\section{Database and method}

In this section, we describe the two datasets used in our experiments. The first dataset is based on
WALS~\cite{haspelmath2008wals} and Automatic Similarity Judgment Program (\citealt{wichmannasjp14}; ASJP). The WALS
dataset\footnote{Accessed on 2011-09-22} has 144 feature types for 2676 languages distributed across the globe. As noted
by~\cite{hh13}, WALS database is sparse across many language families of the world and the dataset needs pruning before
it is used for further investigations. The database is represented as matrix of languages vs. features. The pruning of
the dataset has to be done in both the directions to prevent sparsity when computing the pair-wise distances between
languages. Following~\citet{georgi2010comparing} we remove all the languages which have less than 25 attested features.
Further, we remove all the features which have less than 10\% attestations. This leaves the dataset with 1159 languages
and 193 features.~\citet{georgi2010comparing} work with a pruned dataset of 735 languages and two major families
Indo-European and Sino-Tibetan whereas, we stick to investigating the three questions within the well-defined language
families given in \ref{tab:tab1}. Our dataset has only those families which have more than 10 languages
(following~\citet{wichmann2010evaluating}) and remove all the families which fall below the threshold.

\begin{table}[htb]
\centering
\begin{tabular}{|lc|lc|}
\hline
Family & Count & Family & Count \\
\hline
Austronesian  &  150 (141) & Austro-Asiatic  &  22 (21) \\
Niger-Congo  &  143 (123) & Oto-Manguean  &  18 (14) \\
Sino-Tibetan  &  81 (68) & Arawakan  &  17 (17) \\
Australian  &  73 (65) & Uralic  &  15 (12) \\
Nilo-Saharan  &  69 (62) & Penutian  &  14 (11) \\
Afro-Asiatic  &  68 (57) & Nakh-Daghestanian  &  13 (13) \\
Indo-European  &  60 (56) & Tupian  &  13 (12) \\
Trans-New Guinea  &  43 (33) & Hokan  &  12 (12) \\
Uto-Aztecan  &  28 (26) & Dravidian  &  10 (9) \\
Altaic  &  27 (26) & Mayan  &  10 (7) \\
\hline
\end{tabular}
\caption{Number of languages across family.The number in bracket for each family gives the number of languages present
in the database for each language family,
after mapping with ASJP database.}
\label{tab:tab1}
\end{table}

Each feature in the WALS dataset is either a binary feature, presence or absence of the feature in a language or a
multi-valued features, coded as a discrete integers over a finite range. ~\citet{georgi2010comparing} binarize the
feature values by recording the presence or absence of a feature value in a language. This binarization greatly expands
the length of the feature vector for a language but enables to represent a wide-ranged feature such as \emph{word order}
(which has 7 feature values) in terms of a sequence of 1's  and 0's. The issue of binary vs. multi-valued features has
been a point of debate in genetic linguistics and has been shown to be not give not different results for the
Indo-European classification~\cite{Atkinson:06}.

Apart from typological information for world's languages, WALS also provides a two-level classification of a language
family. In WALS classification, the top level is the family name, the next level is genus and a language rests at the
bottom. For example, Indo-European family has 9 genera. Genus is a consensually defined unit and is not a rigorously
established genealogical unit. Rather, a genus corresponds to a group of languages which are supposed to have descended
from a proto-language which is about 3500 to 4000 years old. For instance, again, WALS lists Indic and Iranian languages
as separate genus whereas, both the genera are descendants of Indo-Iranian subgroup which in turn descended from
Proto-Indo-European; a fact well-known in historical linguistics. The WALS classification each of the language family
listed in Table~\ref{tab:tab1}, can be represented as a 2D-matrix with languages along both rows and columns. Each cell
of such a matrix represents the relationship of a language pair in the family. The cell has 0 if the languages belong
to the same genera and 1 if they belong to different genera. The pair-wise distance matrix obtained from each similarity
measure is compared to the 2D-matrix using a special case of pearson's $r$, called point-biserial correlation.

In a recent large-scale effort by~\cite{brown2008automated}, a international consortium of scholars (calling themselves
ASJP) started collecting Swadesh word lists~\cite{swadesh1952lexico} (a short concept meaning list usually ranging from
40--200) for the languages of the world (more than 58\%) in the hope for automatizing the language
classification of world's languages~\footnote{Available at: \url{{http://email.eva.mpg.de/~wichmann/listss14.zip}}}.
This database has word lists for a language (given by its unique ISO 693-3 code as well as WALS code) and its dialects.
We use the WALS code to map the languages in WALS database with that of ASJP database. Whenever a language with a WALS
code has more than one word list in ASJP database, we arbitrarily retained the first language for our experiments. An
excerpt of word list for Russian is shown in Figure~\ref{fig:fig1}. The first line consists of name of language, WALS
classification (Indo-European family and Slavic genus), followed by Ethnologue classification (informing that Russian
belongs to Eastern Slavic subgroup of Indo-European family).

\begin{figure}[h!]
\centering
\includegraphics[height=0.3\textwidth, width=0.4\textwidth]{asjp_sample.png}
\caption{10 lexical items in Russian language.}
\label{fig:fig1}
\end{figure}

The ASJP program computes the distance between two languages as the average pair-wise length-normalized Levenshtein
distance, called Levenshtein Distance Normalized (LDN)~\citet{levenshtein1965binary}. LDN is further modified
to account for chance resemblance such as accidental phoneme inventory similarity between a pair of languages
to yield LDND (Levenshtein Distance Normalized
Divided;~\citeauthor{holman2008proceedings}~\citeyearpar{holman2008proceedings}). The performance of LDND distance
matrices was evaluated against two expert classifications of world's
languages in at least two recent works~\cite{pompei2011accuracy,wichmann2012correlates} and found to largely agree with
the classification given by historical linguists. This finding puts us on a strong ground to use ASJP lexical
similarity as a measure of pair-wise lexical divergence.

\begin{figure}[h!]
\centering
\includegraphics[height=0.3\textwidth, width=0.6\textwidth]{wals-langs.png}
\caption{WALS languages in the final dataset.}
\label{fig:fig2}
\end{figure}

